fix(viz): improve dtype inference logic by villebro · Pull Request #12933 · apache/superset

villebro · 2021-02-03T23:03:36Z

SUMMARY

Currently query result formats are checked based on the dtype property of the DataFrame coming back from the database. This works for some types, but not

columns that contain nulls (a known limitation in Pandas which causes an otherwise regular int column to become object)
date (=always object)
UTC based timestamps (all timestamps were datetime64[ns, UTC] on e.g. BigQuery when the dtype was cast to string, causing it to be returned as a non-temporal type)

This PR replaces the current inference logic with the infer_dtype utility offered by Pandas and adds tests for typical column types. This is really just a quick fix to make table chart column types work properly, as we're currently inferring column datatypes in three (!) different places in the codebase. A bigger refactor uniting these into one place needs to be done ASAP, but is left to a later SIP/PR. This also bumps superset-ui to the latest version which includes a fix for the default temporal format: apache-superset/superset-ui#937

AFTER

This screenshot shows TIMESTAMP, DATETIME and DATE columns on a BigQuery dataset on table chart with this fix

BEFORE

The same chart prior to the PR

TEST PLAN

Local testing on various databases + new tests

ADDITIONAL INFORMATION

villebro

some annotations

villebro · 2021-02-03T23:07:12Z

tests/utils_tests.py

Here we're making sure that all main datatypes are correctly identified both with and without null values.

superset/utils/core.py

codecov-io · 2021-02-03T23:45:28Z

Codecov Report

Merging #12933 (b5f54f9) into master (9982fde) will decrease coverage by 2.24%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master   #12933      +/-   ##
==========================================
- Coverage   69.14%   66.90%   -2.25%     
==========================================
  Files        1025      489     -536     
  Lines       48767    28674   -20093     
  Branches     5188        0    -5188     
==========================================
- Hits        33718    19183   -14535     
+ Misses      14915     9491    -5424     
+ Partials      134        0     -134

Flag	Coverage Δ
cypress	`?`
javascript	`?`
python	`66.90% <100.00%> (-0.72%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
superset/common/query_context.py	`82.14% <100.00%> (-0.45%)`	⬇️
superset/utils/core.py	`88.15% <100.00%> (+0.11%)`	⬆️
superset/db_engines/hive.py	`0.00% <0.00%> (-85.72%)`	⬇️
superset/sql_validators/postgres.py	`50.00% <0.00%> (-50.00%)`	⬇️
superset/db_engine_specs/hive.py	`73.84% <0.00%> (-16.93%)`	⬇️
superset/databases/commands/create.py	`83.67% <0.00%> (-8.17%)`	⬇️
superset/databases/commands/update.py	`85.71% <0.00%> (-8.17%)`	⬇️
superset/db_engine_specs/sqlite.py	`90.62% <0.00%> (-6.25%)`	⬇️
superset/connectors/sqla/models.py	`84.51% <0.00%> (-6.04%)`	⬇️
superset/databases/commands/test_connection.py	`84.78% <0.00%> (-4.35%)`	⬇️
... and 548 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9982fde...b5f54f9. Read the comment docs.

ktmud · 2021-02-03T23:49:10Z

I think we can first do a mapping based on types returned by pandas.api.types.infer_dtype (which is much more efficient since it uses cpython bindings), then manually infer for mixed when necessary.

ktmud

PANDAS_DTYPE_MAP = {
    "date": GenericDataType.TEMPORAL,
    "time": GenericDataType.TEMPORAL,
    "datetime": GenericDataType.TEMPORAL,
    "datetime64": GenericDataType.TEMPORAL,
    "integer": GenericDataType.NUMERIC,
    "floating": GenericDataType.NUMERIC,
    "decimal": GenericDataType.NUMERIC,
    "mixed-integer-float": GenericDataType.NUMERIC,
    "boolean": GenericDataType.BOOLEAN,
}


def extract_dataframe_dtypes(df: pd.DataFrame) -> List[GenericDataType]:
    """Serialize pandas/numpy dtypes to generic types"""
    return [
        PANDAS_DTYPE_MAP.get(infer_dtype(df[col], skipna=True), GenericDataType.STRING)
        for col in df.columns
    ]

Maybe a manual inference of more complex type is not even needed since infer_dtype can already take care of NA's for us.

tests/utils_tests.py

junlincc

is this ^ normal? in postgresql @ktmud @villebro

junlincc · 2021-02-04T02:43:07Z

this one is in mysql ^
the scope of this fix seems large. @ktmud please help

villebro · 2021-02-04T05:59:17Z

I think we can first do a mapping based on types returned by pandas.api.types.infer_dtype (which is much more efficient since it uses cpython bindings), then manually infer for mixed when necessary.

Good idea, I'll update to use this logic

villebro · 2021-02-04T06:00:12Z

is this ^ normal? in postgresql

Let me check this, too

superset/utils/core.py

villebro · 2021-02-04T14:32:26Z

is this ^ normal? in postgresql

Let me check this, too

@junlincc the wacky formats in the table chart have now been fixed as of superset-ui version 0.17.6 (bumped in this PR). Regarding the results and samples tabs below, previously it wasn't possible to apply formatting to other than the __timestamp column, so this isn't a regression. As of #10270 we're now getting column types in the result payload, hence we can apply formatting. I have a PR mostly ready that does this, but I'd rather leave it for a separate PR to minimize regression risk and isolate this PR as a fix (the one that adds proper formatting to results and samples will be a new feature).

ktmud

LGTM. Agreed the fix for data panel should be in a separate PR.

superset-github-bot bot added the preset-io label Feb 3, 2021

pull-request-size bot added the size/L label Feb 3, 2021

villebro force-pushed the villebro/fix-table-date-format branch from 20a4423 to 944339a Compare February 3, 2021 23:04

villebro commented Feb 3, 2021

View reviewed changes

villebro requested a review from ktmud February 3, 2021 23:12

junlincc added rush! Requires immediate attention #bug:blocking! Blocking issues with high priority hold:testing! On hold for testing labels Feb 3, 2021

ktmud reviewed Feb 4, 2021

View reviewed changes

tests/utils_tests.py Outdated Show resolved Hide resolved

junlincc self-requested a review February 4, 2021 02:35

junlincc reviewed Feb 4, 2021

View reviewed changes

pull-request-size bot added size/M and removed size/L labels Feb 4, 2021

ktmud reviewed Feb 4, 2021

View reviewed changes

superset/utils/core.py Outdated Show resolved Hide resolved

ktmud reviewed Feb 4, 2021

View reviewed changes

superset/utils/core.py Outdated Show resolved Hide resolved

villebro mentioned this pull request Feb 4, 2021

fix(table-chart): bump superset-ui to 0.17.6 #12943

Closed

6 tasks

villebro added 5 commits February 4, 2021 15:26

fix(viz): improve dtype inference logic

585b249

use infer_dtype and add bool to testcase

30e60a8

fix mixed logic, add tests, refactor

6a27c33

remove redundant string type

86681d9

bump superset-ui packages

b5f54f9

villebro force-pushed the villebro/fix-table-date-format branch from 4959d52 to b5f54f9 Compare February 4, 2021 13:41

ktmud approved these changes Feb 4, 2021

View reviewed changes

ktmud merged commit ac73991 into apache:master Feb 4, 2021

villebro deleted the villebro/fix-table-date-format branch February 4, 2021 19:07

villebro added a commit to preset-io/superset that referenced this pull request Feb 4, 2021

fix(viz): improve dtype inference logic (apache#12933)

87dce0c

junlincc removed the hold:testing! On hold for testing label Feb 4, 2021

junlincc removed #bug:blocking! Blocking issues with high priority rush! Requires immediate attention labels Mar 15, 2021

mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 First shipped in 1.2.0 labels Mar 12, 2024

Conversation

villebro commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SUMMARY

AFTER

BEFORE

TEST PLAN

ADDITIONAL INFORMATION

Uh oh!

villebro left a comment

Choose a reason for hiding this comment

Uh oh!

villebro Feb 3, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov-io commented Feb 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ktmud commented Feb 3, 2021

Uh oh!

ktmud left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

junlincc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junlincc commented Feb 4, 2021

Uh oh!

villebro commented Feb 4, 2021

Uh oh!

villebro commented Feb 4, 2021

Uh oh!

Uh oh!

Uh oh!

villebro commented Feb 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ktmud left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

villebro commented Feb 3, 2021 •

edited

Loading

codecov-io commented Feb 3, 2021 •

edited

Loading

ktmud left a comment •

edited

Loading

junlincc left a comment •

edited

Loading

villebro commented Feb 4, 2021 •

edited

Loading